Jypter notebook

Before starting, let's take a look at the Jupyter notebook.

Stopping and halting a kernel
Looking at which notebooks are running
Cells
Adding cells above and below
Changing type of cell from Markdown to Code
Adding math

Class and objects

To import a module, you use the word import and then the name of the module



In [ ]:

    
import sklearn

You are able to import this because the module sklearn is already part of the Anaconda distribution. You can explore the modules that are part of sklearn by doing from sklearn import and then pressing Tab.



In [ ]:

    
# this it below
from sklearn import 

# this also works with submodules
from sklearn.linear_model import



In [ ]:

    
# from the submodule linear_model, lets import LinearRegression
from sklearn.linear_model import LinearRegression

Python is based on object-oriented programming (OOP).

Objects are containers of data and funcionality
Objects are of a class and that class might inherit funcionality from other classes
A class defines when and how the objects of that class would store data and how those objects would behave

The imported LinearRegression is a class definition. You can know the parents of a class by retrieving the __bases__ property



In [ ]:

    
LinearRegression.__bases__

To create an object, you call the class with parameters. To retrieve the possible parameters of class (or function) in the notebook, you can Shift-Tab (preview), double Shift-Tab (expanded window), triple Shift-Tab (expanded window with no time out), quadruple Shift-Tab (for split view of help)



In [ ]:

    
# try it below
LinearRegression()

Now, lets create a linear regression object



In [ ]:

    
lr = LinearRegression()

Again, we can explore that object by typing the name of object, then ., and then Tab



In [ ]:

    
# try it here
lr.

if we type lr into the notebook, we will get a customize description of the object



In [ ]:

    
lr

we can obtain a more programmatically class description by calling the built-in type command



In [ ]:

    
type(lr)

Now, objects have a global identity



In [ ]:

    
id(lr)

Datasets

sklearn has many datasets. We will take a diabetes dataset from it



In [ ]:

    
from sklearn.datasets import load_diabetes



In [ ]:

    
diabetes_ds = load_diabetes()



In [ ]:

    
X = diabetes_ds['data']
y = diabetes_ds['target']

sklearn works mostly with numpy array, which are $n$-dimensional arrays.



In [ ]:

    
[type(X), type(y)]

`Numpy` arrays

You can check the number of dimensions of an array



In [ ]:

    
X.ndim

Check the size of the dimensions



In [ ]:

    
X.shape

Get slices of the dimensions. The following are all the same thing: grab the first two rows of a matrix



In [ ]:

    
X[0:2]



In [ ]:

    
X[:2]



In [ ]:

    
X[0:2, :]

We can also grab columns in the same way



In [ ]:

    
X[:, 0:2]

Sometimes you want to grab just one column (feature), but the numpy returns a one dimensional object



In [ ]:

    
X[:, 2].shape

We can reshape the $nd$-array and add one dimension:



In [ ]:

    
X[:, 2].reshape([-1, 1])



In [ ]:

    
X[:, 2].reshape([-1, 1]).shape

You can do matrix algebra:



In [ ]:

    
# transpose
X.T.shape



In [ ]:

    
X.dot(X.T).shape

For more functions, you can importa numpy



In [ ]:

    
import numpy.linalg as la



In [ ]:

    
la.inv(X.dot(X.T)).shape

Fitting models

OK, let's go back to our example with linear regression.

Usually sklearn objects starts by fitting the data, then either predicting or transforming new data. Predicting is usually for supervised learning and transforming is for unsupervised learning.



In [ ]:

    
# explore the parameters of fit
lr.fit



In [ ]:

    
lr2 = lr.fit(X[:, [2]], y)

fit returns an object. If we examine the id of the object it returns:



In [ ]:

    
id(lr2)



In [ ]:

    
id(lr)

We realize that it is the same object lr, therefore, the call is fitting the data and modifying the internal structure of the object and it is returning itself.

Therefore, you can chain calls, which is very powerful feature.

Explore the fitted object

By looking at the online documentation of the LinearRegression, we can know the parameters it found.



In [ ]:

    
lr.intercept_



In [ ]:

    
lr.coef_

Predicting



In [ ]:

    
# explore the parameters
lr.predict



In [ ]:

    
y_pred = lr.predict(X[:, [2]])

Because we know how linear regression works, we can produce the predictions ourselves



In [ ]:

    
y_pred2 = lr.intercept_ + X[:, [2]].dot(lr.coef_)



In [ ]:

    
# this checks that all entries in the comparison are True
np.all(y_pred2 == y_pred)

Now, due to the powerful concept of chaining, we can combine fit and predict in one line



In [ ]:

    
y_pred3 = lr.fit(X[:, [2]], y).predict(X[:, [2]])



In [ ]:

    
np.all(y_pred3 == y_pred)

Additional packages

Sometimes you want to use a package that you found online. Many of these packages are available throught the Python Install Packages (PIP) package manager.

For example, the package quandl allows quants to load financial data in Python.

We can install it in the console simply by typing

pip install quandl

And now we should be able to import that package



In [ ]:

    
import quandl



In [ ]:

    
import quandl
mydata = quandl.get("YAHOO/AAPL")



In [ ]:

    
mydata.head()

Pandas



In [ ]:

    
# this helps put the plot results in the browser
%matplotlib inline

Pandas is a package for loading, manipulating, and display data sets. It tries to mimick the funcionality of data.frame in R



In [ ]:

    
import pandas as pd

Many packages return data in pandas DataFrame objects



In [ ]:

    
apple_stocks = quandl.get("YAHOO/AAPL")



In [ ]:

    
type(apple_stocks)

We can display the beginning of a data frame:



In [ ]:

    
apple_stocks.head()



In [ ]:

    
apple_stocks.tail()

And also, we can plot it with pandas



In [ ]:

    
apple_stocks.plot(y='Close');

We can manipulate it too. Let's say we want to compute the stock returns

$$ r = \frac{V_t - V_{t-1}}{V_{t-1}} - 1$$

But for this, we need to compute a rolling filter



In [ ]:

    
apple_stocks[['Close']].pct_change().head()



In [ ]:

    
apple_stocks[['Close']].pct_change().plot();



In [ ]:

    
apple_stocks[['Close']].pct_change().hist(bins=100);

Spark

Spark is a distributed in-memory big data analytics framework. It is hadoop on steriods.

Because we launched this jupyter notebook with pyspark, we have available automatically a variable called Spark context sc which gives us access to the master and therefore to the workers.

If we go to see the Spark dashboard (usually in port 4040), we can see some of the variables.

With Spark context you can read data from many sources, including HDFS (Hadoop File System), Hive, Amazon's S3, files, and databases.



In [ ]:

    
# explore the variables and functions availabe in the Spark context
sc

Spark usually works with RDD (Resilient Distributed Dataset) and more recently they are moving towards DataFrame, which are similar to Pandas but distributed instead.



In [ ]:

    
rdd_example = sc.parallelize([1, 2, 3, 4, 5, 6, 7])

We can check the id of the RDD in the cluster



In [ ]:

    
rdd_example.id()



In [ ]:

    
# this is a RDD
type(rdd_example)

Let's explore the funcions we have available



In [ ]:

    
rdd_example.

One such function is take that allows you to get a taste of what the file contains



In [ ]:

    
rdd_example.take(3)

Let's say you want to apply an operation to each element of the list



In [ ]:

    
def square(x):
    return x**2

now we can apply that transformation to the RDD with the map function



In [ ]:

    
rdd_result = rdd_example.map(square)

Now you might notice that this returns immediately. Well, this is because operations on RDD are lazily evaluated



In [ ]:

    
type(rdd_result)

So rdd_result is another RDD



In [ ]:

    
rdd_result.id()

Now in fact, there is no duplication of data. Spark builds a computational graph that keeps tracks of dependencies and recomputes if something crashes.

We can take a look at the contents of the results by using take again. Since take is an action, it will trigger a job in the Spark cluster



In [ ]:

    
rdd_result.take(3)



In [ ]:

    
rdd_result.count()



In [ ]:

    
rdd_result.first()

Usually, one you have your results, you write it back to Hadoop for later preprocessing, because they usually won't fit in memory.



In [ ]:

    
# this function can save into HDFS using Pickle (Python's internal) format
rdd_result.saveAsPickleFile()

Spark's `DataFrame`

Now, DataFrame has some structure. Again, you can create them from different sources. In this case, DataFrame funcionality is available from another context called the sqlContext. This gives us access to SQL-like transformations.

In this example, we will use the sklearn diabetes dataset again



In [ ]:

    
from sklearn.datasets import load_diabetes
import pandas as pd



In [ ]:

    
diabetes_ds = load_diabetes()

To create a dataset useful for machine learning we need to use certain datatypes



In [ ]:

    
from pyspark.mllib.regression import LabeledPoint



In [ ]:

    
l



In [ ]:

    
from pyspark.ml.linalg import Vectors



In [ ]:

    
d



In [ ]:

    
Xy_df = sqlContext.createDataFrame([
        [float(l), Vectors.dense(d)] for d, l in zip(diabetes_ds['data'], diabetes_ds['target'])],
                                  ["y", "features"])



In [ ]:

    
Xy_df

We can register the table in Spark as an SQL



In [ ]:

    
Xy_df.registerTempTable('Xy')

And then run queries



In [ ]:

    
sql_result1_df = sqlContext.sql('select count(*) from Xy')



In [ ]:

    
# which again is lazily executed
sql_result1_df



In [ ]:

    
sql_result1_df.take(1)

We can again run large scale regression using DataFrame



In [ ]:

    
from pyspark.ml.regression import LinearRegression



In [ ]:

    
lr_spark = LinearRegression(featuresCol='features', labelCol="y")



In [ ]:

    
lr_spark.coefficients



In [ ]:

    
lr_results = lr_spark.fit(Xy_df)



In [ ]:

    
lr_results.coefficients



In [ ]:

    
lr_results.intercept